大规模的多模式对比预训练已经证明了通过将多种模式映射到共享嵌入空间中的一系列下游任务的可转移功能。通常,这对每种模式都采用了单独的编码器。但是,最近的工作表明,变形金刚可以支持跨多种方式学习并允许知识共享。受此启发,我们研究了各种模式共享的对比语言图像预训练(MS-CLIP)框架。更具体地说,我们质疑在对比预训练期间可以在跨模态共享变压器模型的多少个参数,并严格检查建筑设计选择,以将沿频谱共享的参数比例定位。在研究的条件下,我们观察到,视觉和语言信号的主要统一编码器优于所有其他分离更多参数的变体。此外,我们发现特定于特定于模态的平行模块进一步提高了性能。实验结果表明,所提出的MS-CLIP方法在零摄像机分类中(在YFCC-100M上进行了预训练)中,最多可超过13 \%相对的香草夹,同时支持降低参数。此外,在24个下游视觉任务的集合中,我们的方法在线性探测中优于Vanilla剪辑。此外,我们发现共享参数导致语义概念来自不同方式在嵌入空间中更接近地编码,从而促进了共同的语义结构(例如注意力模式)从语言到视觉的传递。代码可在\ href {https://github.com/hxyou/msclip} {url}中获得。
translated by 谷歌翻译
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.
translated by 谷歌翻译
使用图像文本对的对比语言图像预测(剪辑)在零拍摄和传输学习设置中的图像分类中取得了令人印象深刻的结果。但是,我们表明,直接应用此类模型以识别对象检测的图像区域导致由于域移位导致的性能差:剪辑训练以与文本描述的整体匹配,而不捕获图像之间的细粒度对齐地区和文本跨度。为了缓解此问题,我们提出了一种称为RegionClip的新方法,可显着扩展剪辑以学习区域级视觉表示,从而在图像区域和文本概念之间实现细粒度对齐。我们的方法利用剪辑模型将图像区域与模板标题匹配,然后预先列出我们的模型以对准要素空间中的这些区域文本对。将预磨料模型转移到开放词汇对象检测任务时,我们的方法显着优于3.8 AP50和2.2 AP的最新技术,分别用于COCO和LVIS数据集的新型类别。更多,学习区域表示支持对象检测的零拍摄推断,显示了对COCO和LVIS数据集的有希望的结果。我们的代码可在https://github.com/microsoft/regionclip上获得。
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e. shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pretrained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at https: //github.com/leoxiaobin/CvT.
translated by 谷歌翻译
Crop type maps are critical for tracking agricultural land use and estimating crop production. Remote sensing has proven an efficient and reliable tool for creating these maps in regions with abundant ground labels for model training, yet these labels remain difficult to obtain in many regions and years. NASA's Global Ecosystem Dynamics Investigation (GEDI) spaceborne lidar instrument, originally designed for forest monitoring, has shown promise for distinguishing tall and short crops. In the current study, we leverage GEDI to develop wall-to-wall maps of short vs tall crops on a global scale at 10 m resolution for 2019-2021. Specifically, we show that (1) GEDI returns can reliably be classified into tall and short crops after removing shots with extreme view angles or topographic slope, (2) the frequency of tall crops over time can be used to identify months when tall crops are at their peak height, and (3) GEDI shots in these months can then be used to train random forest models that use Sentinel-2 time series to accurately predict short vs. tall crops. Independent reference data from around the world are then used to evaluate these GEDI-S2 maps. We find that GEDI-S2 performed nearly as well as models trained on thousands of local reference training points, with accuracies of at least 87% and often above 90% throughout the Americas, Europe, and East Asia. Systematic underestimation of tall crop area was observed in regions where crops frequently exhibit low biomass, namely Africa and South Asia, and further work is needed in these systems. Although the GEDI-S2 approach only differentiates tall from short crops, in many landscapes this distinction goes a long way toward mapping the main individual crop types. The combination of GEDI and Sentinel-2 thus presents a very promising path towards global crop mapping with minimal reliance on ground data.
translated by 谷歌翻译
在充满活力的腿部运动领域,实现稳定的跳跃一直是一个标志性的挑战。由于长期不足,因此,受控跳跃非常困难,加上非常短的地面阶段,必须调节地面相互作用以调节全球状态。在这项工作中,我们探讨了混合非线性模型预测控制的使用,并与多速率层次结构中的低级反馈控制器配对,以在新颖的3D跳架机器人上实现动态稳定的运动。为了在旋转的多种状态上展示更丰富的行为,规划和反馈层都必须以几何一致的方式完成。因此,我们开发了采用谎言组集成商和适当的反馈控制器的必要工具。我们在实验上证明了在新型机器人上稳定的3D跳,以及模拟中的轨迹跟踪和翻转。
translated by 谷歌翻译
在双皮德机器人上生成健壮步态的能力是他们在硬件上成功实现的关键。为此,这项工作扩展了混合零动力学(HZD)的方法 - 传统上,该方法仅在完美影响事件下通过周期性限制来说明机车稳定性 - 通过包含盐矩阵,以构成合成强大的步行步态的观点。通过共同将扩展盐矩阵的规范和步态生成过程中的机器人的扭矩最小化,我们表明合成的步态比单独使用任何一个术语产生的步态更强大。这些结果在模拟和硬件中显示了琥珀色3M平面和阿塔兰特较低体外外骨骼(无论有没有人类)。最终结果是实验验证,即将盐矩阵与HZD方法相结合,在实践中会产生更健壮的两足步行。
translated by 谷歌翻译
对心脏磁共振成像(MRI)进行心室分割时具有弹性的方法,对于确保对这些组织的结构和功能分析的质量至关重要。尽管在提高算法的质量方面做出了重大努力,但很少有作品能够应对伪像在预测中产生的危害。在这项工作中,我们研究了经过验证的网络的微调,以提高以前方法对这些工件的弹性。在我们提出的方法中,我们采用了模仿这些人工制品的数据增强的广泛使用。结果显着改善了基线分割(最高0.06个骰子得分和4mm的Hausdorff距离提高)。
translated by 谷歌翻译
近年来,机器人的操纵和控制的重要性增加了。但是,在现实世界应用中需要操作时,最新技术仍然存在局限性。本文探讨了在模拟环境和真实环境中重播的事后观看经验,突出了其弱点,并根据奖励和目标塑造提出了基于加强学习的替代方案。此外,还发现了一些研究问题以及可以探索以解决这些问题的潜在研究方向。
translated by 谷歌翻译